Table of Contents

Galaxy for virologist training Exercise 8: Viralrecon

Title Galaxy
Training dataset: SARS-CoV-2 downsampled sequencing data used to report variants and lineages to national Spanish epidemiologist.
Questions:
  • How many variants does the samples have
  • Whoch lineage do the samples belong to?
Objectives:
  • Learn how to run viralrecon in Galaxy's interface
  • Understand the results generated
Estimated time: 1h 15 min

In this report you will find all the information necessary to follow the steps to analyze SARS-CoV-2 data with Galaxy.

Training overview

During this training we will following these steps: 1. Register/Login 2. Create a new history and name it Viralrecon 3. Upload data: Upload data for the analysis. 4. Quality: Analysis of the quality of the raw reads. 5. Trimming: Quality trimming using fastp 6. Mapping: Mapping reads to reference genome with Bowtie2 7. Stats: Mapping statistics with samtools and picard. 8. Amplicons: Preprocessing steps mandatory for amplicon sequencing data. 9. Variants: Variant calling and filtering. 10. Consensus: Consensus genome generation

From now on, each job we run in Galaxy will have a unique number for identifying each process. This numbers can differ depending on the number of samples and the times you run or delete any process. This training's snapshots were taken using other samples and some process were deleted for any reason, so numbers and names MAY DIFFER. However, the steps you have to run are THE SAME

History

  1. Create a new history in the ➕ and name it Viralrecon

Data

We are going to upload files using these URLS as seen in the Galaxy tutorial first day

https://zenodo.org/record/5724464/files/SARSCOV2-1_R1.fastq.gz?download=1
https://zenodo.org/record/5724464/files/SARSCOV2-1_R2.fastq.gz?download=1
https://zenodo.org/record/5724464/files/SARSCOV2-2_R1.fastq.gz?download=1
https://zenodo.org/record/5724464/files/SARSCOV2-2_R2.fastq.gz?download=1

Prior to any analysis, we have to download the fasta reference genome using the following URL:

https://zenodo.org/record/5724970/files/GCF_009858895.2_ASM985889v3_genomic.200409.fna.gz?download=1

Also, you will download the bed file of the amplicon primers, which contains the positions in the reference genome of each of the amplicon primers. Use this URL in the window:

https://zenodo.org/record/5724970/files/nCoV-2019.artic.V3.scheme.bed.txt?download=1

Finally, rename and tag the data as follows:

viralrecon_tag

Quality

Quality Analysis (FastQC)

Once we have the raw data, an important step is to analyze the quality of the reads, to know if the reads are worth it. To do this, we have to look for the program "FastQC" in the search bar, then select FastQC Read Quality reports and set the following parameters, same as here:

- Select multiple file data set and select the fastq files R1 and R2 for both samples
- With *Ctrl* select the two datasets
- Then go down and select **Execute**

viralrecon_fastqc

FastQC results visualization and interprepation questions

To visualize the information coming from FastQC we just have to select the job of interest. In this case we are interested in the "Web page results" so for the sample we want to see the results we have to click in the eye to visualize galaxy results:

For more information about FastQC output visit FasxstQC website

Question

Which is the read length? What type of sequencing are we doing?
150 maximum. 2x150 sequencing (paired data of 150 read length)
How many reads has samlpe1 before trimming?
50000
How many reads has samlpe2 before trimming?
50000 (this is because we downsampled the data manually, usually samples from the same run do not have same number of reads)

Trimming

Quality trimming (Fastp)

Once we have check the quality of our reads, it's important to trim low quality nucleotides from those reads, for which we will use Fastp. So, in the search bar you look for fastp and then select "fastp - fast all-in-one preprocessing for FASTQ files". There, we will have to change some parameters ensure the trimming accuracy for this amplicon data. First of all we are going to do the analysis for the sample we gave to you (201569). These are the field we will have to change:

  1. Search for fastp in the tools and select fastp - fast all-in-one preprocessing for FASTQ files
  2. Select custom parameters:
  3. Finally, click on Execute

fastp1 fastp2 fastp3 fastp4

A message will appear, which means that 3 results will be generated:

  1. Two, one with the R1 trimmed reads, for each sample
  2. Another two, one with the R2 trimmed reads, for each sample
  3. Two, one with the HTML results, for each sample

Repeat these steps for the second sample

Fastp results

Once fastp analysis is done, you can see the results by clicking in the eye ("View Data") in the fatp HTML results.

Among the most relevant results, you have the:

For more information about FastQC output visit Fastp github

Question

How many reads are we keeping from sample1?
65.056 reads (65.06%)
How many reads did we lost for sample1 and why?
Low quality: 29.092 (29,09%)
Too many Ns: 6 (0%)
Too short: 5.846 (5.85%)
How many reads are we keeping from sample2?
63.318 (63.32%)
How many reads did we lost for sample2 and why?
Low quality: 31.526 (31.53%)
Too many Ns: 4 (0%)
Too short: 5.152 (5.15%)

Mapping

In order to call for variants between the samples and the reference, it's mandatory to map the sample reads to the reference genome. To do this we need the fasta file of the reference and the Bowtie2 index of that fasta file.

Mapping reads with reference genome (Bowtie2)

Now we can start with the main mapping process. The first thing we have to do is look for the program "Bowtie2" in the search bar and then select "Bowtie2 - map reads against reference genome". Here we will have to set the following parameters, for the first sample, same as here

  1. Is this single or paired library > Paired-end
  2. Fasta/Q file #1: fastp Read 1 output for both samples
  3. Fasta/Q file #2: fastp Read 2 output for both samples
  4. Will you select a reference genome from your history or use a built-in index? > Use a genome from the history and create index
  5. Select reference genome > GCF_009858895.2_ASM985889v3_genomic.200409.fna.gz
  6. Do you want to use presets? > Very sensitive local
  7. Save the bowtie2 mapping statistics to the history > Yes
  8. Execute

bowtie1 bowtie2

Mapping results

Now we can see the mapping results for the samples. The bowtie2 resulting file is a .bam file, which is not easy to read by humans. This .bam file can be downloaded by clicking in the alignment file and then into download. Then, the resulting .gz file will contain the alignment .bam file that can be introduced in a software such as IGV with the reference genome fasta file.

In our case, the file that can be visualize is the mapping stats file, which contains information such as the percentage of reads that aligned.

Question

Which is the overall alignment rate for sample1?
99.69%
And the overall alignment rate for sample2?
97.75%

Stats

The previously shown files give few human readable information, because mapping files are supposed to be used by other programs. In this sense, we can use some programs to extract relevant statistical information about the mapping process.

Samtools flagstat

The first program is Samtools, from which we will use the module samtools flagstat. To do this, we have to look in the search bar for "samtools flagstat" and then select "Samtools flagstat tabulate descriptive stats for BAM datset". There, we just have to select the samples we want to perform the mapping stats (in the example there are two samples, you just have to use one): Bowtie2 on data X, data X and data X: alingment. You can select the samples from the list in Multiple datasets or select the folder icon (Browse datasets) to select the file from the history. Finally, select Execute

samtools_flagstat_1

Samtools results

The results of the samtools program gives information about the number and percentage of reads that mapped with the reference genome.

Question

How many reads mapped against the refernce genome for sample1?
64856
And how many for sample2?
61891

Picard CollectWgsMetrics

Another program that gives statistical information about the mapping process is Picard. To run this program you just have to search "Collect Wgs Metrics" and then select "CollectWgsMetrics compute metrics for evaluating of whole genome sequencing experiments".

You have to change the following parameters:

  1. Select SAM/BAM dataset or dataset collection > Dataset collection > Select both bam files at once
  2. Load reference genome from > History
  3. Select the fasta file we uploaded with the reference genome (NC_045512.2).
  4. Treat bases with coverage exceeding this value as if they had coverage at this value = 1000000
  5. Select validation stringency > Lenient
  6. Execute.

picard_wgsmetrics1

This process will generate one output file per .bam alignment file selected as input.

Picard results

Picard results consist in quite long files, so the best is to download those results and visualize them in your computer. Yo you have to click in the CollectWgsMetrics job you want to download, and then click in the save button:

Then you just have to open the file with Excell in your computer, and you will see a file with different columns with information about the percentage of the reference genome that is covered by the reads at a specific depth or the mean depth of coverage of the reference genome.

Question

Which is the mean coverage for sample1?
156.672207
Which percetage of the reference genome is covered to more than 10X by sample1 reads?
78.44%
Which is the mean coverage for sample2?
158.232619
Which percetage of the reference genome is covered to more than 10X by sample2 reads?
88.18%

Amplicons

After mapping the reads to the reference genome, we are interested in removing the sequences of the amplicon primers. To do that you will use a program called iVar, and you will need a bed file with the positions of those amplicon primers.

Trim amplicon sequences

Once you have the bed file, you just have to search for "ivar trim" in the search bar and select "ivar trim Trim reads in aligned BAM". Then follow these steps:

  1. Bam file > Select the aligment bam file generated with Bowtie2 for both samples.
  2. BED file with primer sequences and positions > Select the Amplicon bed file.
  3. Include reads with no primers > Yes.
  4. Minimum length of read to retain after trimming = 20

ivar_trim1

iVar trim results

The resulting file from iVar will be a new BAM file where amplicon primer positions will be removed, so there's no result to visualize.

Variants

Once we have the alingment statistics and files with amplicon primers trimmed, we can start with the variant calling process.

iVar variants

iVar uses primer positions supplied in a BED file to soft clip primer sequences from an aligned and sorted BAM file. Following this, the reads are trimmed based on a quality threshold(Default: 20). To do the quality trimming, iVar uses a sliding window approach(Default: 4). The windows slides from the 5' end to the 3' end and if at any point the average base quality in the window falls below the threshold, the remaining read is soft clipped. If after trimming, the length of the read is greater than the minimum length specified(Default: 30), the read is written to the new trimmed BAM file.

  1. Search for ivar variants and select ivar variants Call variants from aligned BAM file
  2. Bam file > Select ivar trimmed bam files for both samples
  3. Minimum frequency threshold > 0,75
  4. Output format > Both tabular and VCF
  5. In VCF only output variants that PASS all filters > Yes

ivar_variants

iVar results

iVar results consist in a VCF file containing all the variants found between the reference and the samples. Each line represents a variant the columns give information about that variant, such as the position in the reference genome, the reference allele, the alternate allele, if that variant passed the filters, and so on.

This variants have passed the minimum quality filter, which we set as 20, and the minimum allele frequency of 75%.

Question

How many possitions are diferent (variant) between reference and sample1 that pass all filters?
40
And how many between reference and sample2 that pass all filters?
40

Annotation with SnpEff

Once we have the variants called, it's interesting to annotate those variants, for which you will use SnpEff. Search for "snpeff" in the searh bar and select "SnpEff eff: annotate variants for SARS-CoV-2", then change the following parameters:

  1. Sequence changes (SNPs, MNPs, InDels) > Select ivar output VCF for both samples
  2. Create CSV report, useful for downstream analysis (-csvStats) > Yes

snpeff

SnpEff results

The SnpEff gives three different results, from which the most interesting ones are:

  1. Snpeff eff: Which is a VCF file with the annotation results. It is a very similar file to the ones we saw before for VarScan and Bcftools but with the last column different, containing relevant information about that variant.

  2. Snpeff eff CSV stats: This file is a CSV file that contains statistics about the variant annotation process, such as the percentage of variants annotated, the percentage of variants that are MISSENSE or SILENT, the percentage that have HIGH, MODERATE or LOW effect, and so on.

Question

How many missense variants has sample1?
25
How many INDELs has sample1?
3
How many missense variants has sample2?
31
How many INDELs has sample2?
1

Consensus

Once we have the most relevant variants that can be considered to include in the consensus genome, you can start with the consensus genome generation.

Bcftools consensus

The first step consist in including the called variants into the reference genome, for which you will search for "bcftools consensus" in the search bar and then select "bcftools consensus Create consensus sequence by applying VCF variants to a reference fasta file". In this module you have to select:

  1. VCF/BCF Data > VCF resulting from iVar variants for both samples
  2. Choose the source for the reference genome > Use a genome from the history
  3. Reference genome > Fasta file uploaded at the begining.
  4. Execute

bcftools_consensus

This will just generate a fasta file identical to the reference one, except for those nucleotides that are variants from the VCF file.

Genome coverage calculation

At this point, we have the consensus viral genome, but we know that we have filtered the variants based on the coverage, selecting only those that had a coverage depth higher than 10X. So we cannot ensure that the consensus genome doesn't have any variant that we have filter in those regions with a coverage lower than 10X. So the next step is to determine which regions of the reference genome have a coverage lower than 10X.

To do that you will search for "bedtools genomecov" in the search bar and select "bedtools Genome Coverage compute the coverage over an entire genome", the you will have to select the following files:

  1. Input type > BAM
  2. BAM file > iVar trim output bam files for both samples
  3. Output type > BedGraph coverage file
  4. Report regions with zero coverage > Yes
  5. Execute

bedtools_genomecov

This process will generate a BED file where each genomic position range of the reference genome has the coverage calculated. In this example you can see that for the positions of the reference genome from the nucleotide 2 to 54 they have a coverage of 2X and then will be masked.

bedtools_genomecov_result

Regions filtering

From this resulting file from betdools genomecoverage you are going to select those regions with a coverage lower than 10X. Writing in the search bar "awk" and selecting "Text reformatting with awk", you are going to change:

  1. File to process > Bedtools genome coverage file with the coverage regions for both samples
  2. AWK Program = $4 < 10
  3. Execute

awk

The resulting file is exactly the same as the one in Bedtools genomecoverage but only containing those lines with the genomic region coverage lower than 10X.

Masking the consensus genome

Now that you have the consensus genome and the regions with a sequencing depth lower than 10X, you are going to "mask" those regions in the consensus genome replacing the nucleotides in those regions with "N"s. You have to search for "bedtools maskfasta", select "bedtools MaskFastaBed use intervals to mask sequences from a FASTA file" and then select the following parameters:

  1. BED/bedGraph/GFF/VCF/EncodePeak file > Select the BED files resulting from AWK text filter for both samples
  2. FASTA file > Select the consensus genome fasta file generated with Bcftools consensus, for both samples
  3. Execute

bedtools_maskfasta

The resulting file is the consensus genome generated previously but now only contains Ns instead of A, T, G or C in the regions with less than 10X depth of coverage

bedtools_maskfasta_result

You can download this fasta file and use it to upload it to any public repository such as ENA or GiSaid. Also you can use it to perform phylogenetic trees or whatever else you want to do with the SARS-CoV-2 consensus fasta file.

Lineage

Now we are going to determine the lineage of the samples. We will use a software called pangolin. We are going to use the masked consensus genomes generated in the previous steps as follows:

  1. Search for the pangolin tool
  2. Select Pangolin Phylogenetic Assignment of Outbreak Lineages and set the following parameters:
  3. Select the bedtools MaskFastaBed generated in the previous step as input fasta file for both samples
  4. Execute

pangolin

Now we are going to have a look to the results from pangolin. As you can see, results are in table format, where you have in first place the reference genome and then de lineage assingned.

Question

Which is the lineage of sample1?
B.1.1.7 (Alpha)
And the lineage of sample2?
AY.33 (Delta)

All results

If you have any problem following this training and you want to visualize the resulting file you can access them through this URL:

https://usegalaxy.eu/u/s.varona/h/viralrecon

And viralrecon workflfow in:

https://usegalaxy.eu/u/s.varona/w/viralrecon2022